Tabular Data & R Overview

Erik Fredner

2024-08-28

Tabular Data

  • Tabular data is a common data structure.
  • It is a table (“table” → “tabular”) with rows and columns.
  • Rows contain observations.
  • Columns contain features or variables.

Excel is tabular

Table with dog breed, height, weight

Observations (dogs) go in rows

Dogs with highlighted row.

Features (breed, height, weight) go in columns

Dogs with highlighted column.

Tibbles are tabular

breed <- c("Shih-Tzu", "Labrador", "Beagle", "Newfoundland", "Chihuahua", "Affenpinscher")
weight <- c(5.5, 33, 10.2, 70, 1.3, 9.6)
height <- c(24, 56, 34, 69, 20, 27)
dogs <- tibble(breed, weight, height)
dogs
# A tibble: 6 × 3
  breed         weight height
  <chr>          <dbl>  <dbl>
1 Shih-Tzu         5.5     24
2 Labrador        33       56
3 Beagle          10.2     34
4 Newfoundland    70       69
5 Chihuahua        1.3     20
6 Affenpinscher    9.6     27

One feature: breed

dogs |>
  select(breed)
# A tibble: 6 × 1
  breed        
  <chr>        
1 Shih-Tzu     
2 Labrador     
3 Beagle       
4 Newfoundland 
5 Chihuahua    
6 Affenpinscher

One observation: "Beagle"

dogs |>
  filter(breed == "Beagle")
# A tibble: 1 × 3
  breed  weight height
  <chr>   <dbl>  <dbl>
1 Beagle   10.2     34
  • == is a comparison operator that means “is equal to.”
  • We’re saying, “Filter this tibble for rows where breed is equal to "Beagle".”

Data types

  • Every feature (column) has a data type.
  • Two common data types in the dogs table:
    • character (text, e.g., "Beagle")
    • numeric (numbers, e.g., 10.2)
  • Each feature has a single data type.

Data types in dogs

dogs
# A tibble: 6 × 3
  breed         weight height
  <chr>          <dbl>  <dbl>
1 Shih-Tzu         5.5     24
2 Labrador        33       56
3 Beagle          10.2     34
4 Newfoundland    70       69
5 Chihuahua        1.3     20
6 Affenpinscher    9.6     27
  • chr is short for character
  • dbl is short for double (a type of numeric)

Takeaways

  • Good data structure is the most important thing we will learn in this class.
  • Real-world data science is largely about gathering, cleaning, and organizing data.
  • Don’t take good data structure for granted!

Introduction to some key concepts in R

Objects

  • Everything in R is an object.
  • We create new objects by assigning values to names:
almost_pi <- 3.14
almost_pi <- 3.1415
almost_pi <- 3.141592653589793238462643383279
# note the rounding:
almost_pi
[1] 3.141593

Functions

Functions take inputs and generate outputs.

# sqrt is a function that takes a number and returns its square root
sqrt(9)
[1] 3

Comparison operators

Comparison operators compare two values and return either TRUE or FALSE.

# Is the square root of 9 equal to 3?
sqrt(9) == 3
[1] TRUE
# Is the square root of 10 less than 3?
sqrt(10) < 3
[1] FALSE

Functions take arguments

  • Arguments may be named. If unnamed, arguments are evaluated by position.
  • If named, they may be evaluated in any order.
# round is a function that takes a numeric vector and rounds it to the nearest integer
round(almost_pi)
[1] 3
round(almost_pi, digits = 6)
[1] 3.141593
# this is the same as above because digits is the first argument
round(almost_pi, 6)
[1] 3.141593
# this is the same as above, but because we name x, we can invert the order:
round(digits = 6, x = almost_pi)
[1] 3.141593

Pipes

  • Pipes |> chain functions together.
  • They pass the output of one function to the first input (x) of the next function.
  • They improve readability and reduce the need for intermediate objects.

Why do pipes improve readability?

# this is hard to read
abs(tan(log(exp(8), base = 2)))
[1] 1.645831
# this is annoying to write
temp <- exp(8)
temp <- log(temp, base = 2)
temp <- tan(temp)
temp <- abs(temp)
temp
[1] 1.645831
# this is easier to read and easier to write
8 |>
  exp() |>
  log(base = 2) |>
  tan() |>
  abs()
[1] 1.645831

Formatting code

  • Well formatted code is not just nice.
    • It’s essential to share your code.
  • These are not my opinions. They’re from the tidyverse style guide.
  • Some rules I will enforce:

Spacing around operators

Put spaces before and after operators.

# bad spacing is hard to read
bad<-1+2/3

# good spacing
good <- 1 + 2 / 3

But as in English prose, no space before a comma:

# bad spacing is unnatural
bad <- round(almost_pi , 2)

# good spacing
good <- round(almost_pi, 2)

Pipe spacing

Pipes |> require vertical and horizontal spacing:

# bad spacing
8|>exp()|>log(base=2)|>tan()|>abs()
[1] 1.645831
# good spacing: note the indentations (tabs)
8 |>
  exp() |>
  log(base = 2) |>
  tan() |>
  abs()
[1] 1.645831

Naming stuff

Naming objects is hard! Bad names make things more difficult for everyone.

Rules for names:

  1. Use lowercase letters, numbers, and underscores.
  2. Use snake case (e.g., snake_case).
  3. Write the shortest, clearest name you can.

Naming stuff: examples

  • Horrible: WeightOfDogInKilograms
  • Bad: weight_of_dog_in_kg
  • Okay: dog_weight
  • Best: weight

The last and best option is only available if your data is structured correctly!

Running code in .Rmd files

There are many ways to run code in an .Rmd file:

  • Click the green play button in the RStudio script editor.
  • Use the Command Pallette (Cmd + Shift + P) and search for “Run the current code chunk.”
  • Keyboard shortcuts: Ctrl + Shift + Enter (Windows) or ⌘ + Shift + Enter (macOS)
  • Click the Knit button to render the entire document as an HTML file.